-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Use narwhals as dataframe-agnostic backend
#7964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces narwhals as a dataframe-agnostic backend for PyMC, enabling support for multiple dataframe libraries (pandas, polars, dask.dataframe) through a unified interface. The implementation uses a registration mechanism to handle different dataframe backends dynamically.
Key changes:
- Replaces direct pandas usage with narwhals compatibility layer
- Implements singledispatch-based coordinate determination for different data types
- Adds support for polars and dask.dataframe in addition to pandas
Reviewed Changes
Copilot reviewed 11 out of 12 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| pymc/pytensorf.py | Implements dataframe backend registration and tensor conversion using narwhals |
| pymc/data.py | Refactors coordinate determination with singledispatch and narwhals integration |
| tests/test_pytensorf.py | Extends tests to cover pandas, polars, and dask.dataframe |
| tests/test_data.py | Adds polars-specific tests for series and dataframe coordinate inference |
| requirements.txt | Adds narwhals>=2.11.0 as a core dependency |
| requirements-dev.txt | Adds narwhals>=2.11.0 to development dependencies |
| conda-envs/*.yml | Adds narwhals>=2.11.0 to all conda environment files |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
72cfa96 to
9bcd9de
Compare
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7964 +/- ##
==========================================
- Coverage 91.49% 91.45% -0.04%
==========================================
Files 116 116
Lines 18962 19016 +54
==========================================
+ Hits 17349 17391 +42
- Misses 1613 1625 +12
🚀 New features to boost your workflow:
|
9501ac3 to
48dcc85
Compare
|
I love this! I will be reviewing this soon :) |
| return coords, _handle_none_dims(dims, value.ndim) | ||
|
|
||
|
|
||
| def _dataframe_agnostic_coords( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a simple test for this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a test for this. The new test revealed that my code for inferring an "index column" in the dataframe based on the left-most dim was broken.
I was trying to do was to allow for an "index column" in backends with no index (Polars). So if your dims were (date, country) and you passed in a polars dataframe with columns [USA, Chile, Mozambique, date], it would drop the "date" column, use it as a coord, then pass in the rest as the data to pm.Data. But this ended up being something of an API break, and I didn't like that we ended up mutating the user inputs silently in the background. I removed it, so only a pandas dataframe will give you index coords when infer_coords_and_dims=True
Frankly though I kind of hate this feature, it's too magical. Just force people to write out their coords.
juanitorduz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Just a small comment / request
a8b264e to
ff7af55
Compare
Description
Continues/completes #7462. I didn't have permission to push into that PR, so I'm opening this one.
The purpose of this PR is to use narwhals as a one-stop shop dataframe backend. Currently, we use pandas in
data.pyandpytensorf.pyto allow users to pass dataframe objects intopm.Dataandpt.as_tensor, respectively. I add a narwhals compatibility layer between the input and the pymc model to allow the user to bring his data in any form that narwhals supports, provided we register the libraries.(If we could eliminate registration that would also be great, but I wasn't clever enough to figure out the multiple dispatch using only narwhals as a dependency. Maybe @MarcoGorelli could help 👉 👈 )
Some notes:
pm.Data. Instead, we look for a column matching that dimension name. If it is found, it is used as labels, and excluded from the values.nw.LazyFrameandnw.LazySeries. I don't think we can do anything with those at the modeling level (maybe in the future with minibatching?). For now, I'm just calling.collect()on them to make them eager.dask.dataframebackend as an example of how we could extend things.As of this PR,
pandascould be made an optional or dev-only dependency for us. I didn't do it right away because I wanted to take people's temperature on the idea.Related Issue
Checklist
Type of change